Survey on Comparable Corpora until June 2012
نویسنده
چکیده
Here we present a survey of important work done on Comparable Corpora between the period 1995 to 2012. Unlike parallel corpora, which are clearly defined as translated texts, there is a wide variation of non-parallelism in comparable text. Non-parallelism is manifested in terms of differences in author, domain, topics, time period, language. The most common text corpora have non-parallelism in all these dimensions. The higher the degree of non-parallelism, the more challenging is the extraction of bilingual information. Such a corpus is nevertheless a desirable source of bilingual information, especially for new words. In this report we have first classified the research on comparable corpora into various categories. This is followed by detailed literature survey on comparable corpora and comparability metrics. After that we discuss the work related to the enhancement of comparability metrics in corpus. We conclude with the brief summary of this survey on comparable corpora. 1 Classification of Work on Comparable Corpora We can broadly classify the research on comparable corpora into the following sections. • Correlation based extraction • Vector representation • Classifiers based extraction • Linguistic knowledge based extraction Each of these classes are described in the following sections along with the gist of pioneering work in these domain. 2 Correlation based Extraction Most of the work on comparable corpora is based on correlation between word co-occurrence. They consider the context of a word as a feature to map the source word to the target word. Moreover most of the work based on this idea is focused towards extraction of bilingual lexicons only rather than parallel sentences.
منابع مشابه
استخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملMeasuring Comparability of Multilingual Corpora Extracted from Wikipedia ∗ Midiendo la comparabilidad de copus multilingües extráıdos de la Wikipedia
Comparable corpora can be used for many linguistic tasks such as bilingual lexicon extraction. By improving the quality of comparable corpora, we improve the quality of the extraction. This article describes some strategies to build comparable corpora from Wikipedia and proposes a measure of comparability. Experiments were performed on Portuguese, Spanish, and English Wikipedia.
متن کاملNinth Workshop on Building and Using Comparable Corpora Workshop Programme
Comparable corpora are the most versatile and valuable resource for multilingual Natural Language Processing. The speaker will argue that comparable corpora can support a wider range of applications than has been demonstrated so far in the state of the art. The talk will present completed and ongoing work conducted by the speaker and colleagues from his research group where comparable corpora a...
متن کاملRevising the Compositional Method for Terminology Acquisition from Comparable Corpora
In this paper, we present a new method that improves the alignment of equivalent terms monolingually acquired from bilingual comparable corpora: the Compositional Method with Context-Based Projection (CMCBP). Our overall objective is to identify and to translate high specialized terminology made up of multi-word terms acquired from comparable corpora. Our evaluation in the medical domain and fo...
متن کاملWikipedia as an SMT Training Corpus
This article reports on mass experiments supporting the idea that data extracted from strongly comparable corpora may successfully be used to build statistical machine translation systems of reasonable translation quality for in-domain new texts. The experiments were performed for three language pairs: SpanishEnglish, German-English and RomanianEnglish, based on large bilingual corpora of simil...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012